{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas聚合操作及透视表、交叉表操作\n",
    "## 一.Pandas分组聚合\n",
    "### 1.groupby方法\n",
    "### 2.agg方法\n",
    "### 3.apply方法\n",
    "### 4.transform方法\n",
    "## 二.Pandas透视表与交叉表\n",
    "### 1.透视表的创建\n",
    "### 2.交叉表的创建\n",
    "## 知识目标：掌握Pandas分组聚合及透视表、交叉表的原理\n",
    "## 技能目标：掌握Pandas分组聚合常用方法进行组内计算\n",
    "## 重点：Pandas分组聚合方法\n",
    "## 难点：透视表、交叉表的创建\n",
    "========================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一.Pandas分组聚合\n",
    "在Pandas中，分组是指使用特定的条件将原数据划分为多个组；聚合就是对每个组中的数据执行某些操作（如均值、求和等聚合操作，或转换操作等），最后将计算的结果进行整合。\n",
    "分组与聚合过程示意图：（拆分-应用-合并）见word文档\n",
    "### 1.使用groupby方法拆分数据\n",
    "\n",
    "通过groupby()方法将数据集按照某些标准划分成若干个组。\n",
    "\n",
    "语法：groupby(by=None, axis=0, level=None, as_index=True, sort=True,group_keys=True, squeeze=False, observed=False, **kwargs)\n",
    "#by：用于确定进行分组的依据。\n",
    "#axis：表示分组轴的方向。\n",
    "#as_index：表示聚合后的标签是否以DataFrame索引形式输出，默认为True。\n",
    "#sort：表示是否对分组标签进行排序，接收布尔值，默认为True。\n",
    "\n",
    "groupby()方法会返回一个GroupBy对象，该对象实际上并没有进行任何计算，只是包含一些关于分组键的中间数据而已。\n",
    "#使用Series调用groupby()方法返回的是SeriesGroupBy对象。\n",
    "#使用DataFrame调用groupby()方法返回的是DataFrameGroupBy对象。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例1：读取“超市营业额2”数据，并将数据按照“柜台”进行分组。\n",
    "df = pd.read_excel(r'D:\\pyData\\超市营业额2.xlsx')\n",
    "#分组后的结果并不能直接查看，而是被存在内存中，输出的是DataFrameGroupBy或SeriesGroupBy对象的内存地址。\n",
    "group_obj = df.groupby(by='柜台')\n",
    "group1= group_obj['交易额']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#如果想查看DataFrameGroupBy的具体内容，可以使用for循环遍历DataFrameGroupBy对象。\n",
    "for i in group_obj:\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例2：按照“柜台”进行分组后，求各个组的交易总额\n",
    "group1.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "练一练：\n",
    "\n",
    "1.查看不同时段的交易平均额\n",
    "\n",
    "2.每个员工不同时段的交易总额"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(by='时段')['交易额'].mean()\n",
    "df.groupby(by=['姓名','时段'])['交易额'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.使用agg方法聚合数据\n",
    "\n",
    "agg、aggregate方法都支持对指定的列应用某些函数进行聚合，如果内置方法无法满足聚合要求时，则可以自定义函数，将它作为参数传给agg()方法，实现Pandas对象的聚合运算。\n",
    "\n",
    "语法：agg（func, axis = 0,* args,** kwargs ）\n",
    "#func：表示用于汇总数据的函数，可以为单个函数或函数列表。\n",
    "#axis：表示函数作用于轴的方向，0或index表示将函数应用到每一列；1或columns表示将函数应用到每一行，该参数的默认值为0。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例3：统计每个员工的交易总额和交易日期内的最大值（不用字段的不同统计量）\n",
    "df.groupby(by='姓名').agg({'交易额':np.sum,'日期':np.max}) #列名当作键，函数当作值，np.sum还可以写为'sum',np.max写为'max'\n",
    "df.groupby(by='姓名').agg({'交易额':'sum','日期':['min','max']})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例4：统计每个员工交易额的极差值（用自定义函数）\n",
    "def maxmin(data):\n",
    "    s=data.max()-data.min()\n",
    "    return s\n",
    "df.groupby(by='姓名')['交易额'].agg(maxmin)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考：用lambda表达式统计每个员工交易额的极差值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.使用apply方法聚合数据\n",
    "\n",
    "apply方法与agg方法类似，能够应用于每一列，不同之处在于，apply默认传入整个DataFrame\n",
    "\n",
    "语法：apply(func, axis=0, broadcast=None, raw=False, reduce=None,result_type=None, args=(), **kwds)\n",
    "#func：表示应用于某一行或某一列的函数。\n",
    "#axis：表示函数操作的轴向。\n",
    "#broadcast：表示是否将数据进行广播。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例5：对比agg()与apply()的区别，查看餐饮订单按order_id分组后的单价总和销量总和\n",
    "from sqlalchemy import create_engine\n",
    "engine = create_engine('mysql+pymysql://root:123456@127.0.0.1:3306/dsj?charset=utf8')\n",
    "detail = pd.read_sql('meal_order_detail1',con=engine)\n",
    "#apply方法与agg方法类似，能够应用于每一列，不同之处在于，apply默认传入整个DataFrame\n",
    "detail[['order_id','counts','amounts']].groupby(by='order_id').agg(lambda x: x.sum())\n",
    "detail[['order_id','counts','amounts']].groupby(by='order_id').apply(lambda x: x.sum())\n",
    "\n",
    "#使用apply()方法对groupby对象进行聚合操作和agg()方法相同，但agg()方法能够实现对不同的列应用不同的函数，而apply()方法不行。\n",
    "df.groupby(by='姓名').apply({'交易额':np.sum,'日期':np.max})#会报错，apply不支持字典类型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.transform方法聚合数据\n",
    "如果希望聚合后的数据与原数据保持一样的形状，那么可以通过transfrom()方法实现。transfrom()方法只有一个参数“func”。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例6：对比agg()与transform(),查看餐饮订单按order_id分组后的单价总和\n",
    "detail.groupby(by='order_id')['amounts'].agg('sum')\n",
    "detail.groupby(by='order_id')['amounts'].transform('sum')  #观察Length"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考：检验order_id为774的订单，单价总和是否为314？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "detail.loc[detail['order_id']=='774']['amounts'].agg('sum')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考：求每组订单中，各菜品的单价占每组总费用的多少？\n",
    "\n",
    "解题思路：\n",
    "\n",
    "1.写出占比公式 = ’amounts’ / [ ’amounts’].sum()\n",
    "\n",
    "2.思考每组总费用用agg()聚合还是transform()聚合求得？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "detail['Percent_of_Order'] = detail['amounts'] / detail.groupby(by='order_id')['amounts'].transform('sum')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二.Pandas透视表与交叉表\n",
    "数据透视表是数据分析中常见的工具之一，根据一个或多个键值对数据进行聚合，根据行或列的分组键将数据划分到各个区域。\n",
    "### 1.使用pivot_table函数创建透视表\n",
    "\n",
    "语法：pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')\n",
    "#data：表示创建透视表的源数据\n",
    "#values：用于指定要聚合的数据列名，默认使用全部数据\n",
    "#index：行分组键，用来指定哪一列数据作为结果DataFrame的行索引\n",
    "#colunms：列分组键，用来指定哪一列数据作为结果DataFrame的列名\n",
    "#aggfunc：用来指定数据的聚合方式，如果不指定函数，默认mean\n",
    "#fill_value：用来指定把透视表中的缺失值替换为什么值\n",
    "#margins：表示汇总功能的开关，如果为True，结果会出现名为“ALL”的行和列，默认为False\n",
    "#dropna：表示是否删除缺失值,默认为False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例7：读取“超市营业额2”数据，创建能查看每人每天营业总额的数据透视表\n",
    "pd.pivot_table(df,values='交易额',index='姓名',columns='日期',aggfunc='sum').iloc[:,:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考：\n",
    "\n",
    "1.如何显示汇总ALL值\n",
    "\n",
    "2.将例7透视表中出现NAN值，全部用0替换"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.使用crosstab()函数创建交叉表\n",
    "交叉表是一种特殊的透视表，主要用于统计频次。pandas提供crosstab()函数，根据一个DataFrame对象中的数据生成交叉表，返回新的DataFrame。\n",
    "\n",
    "crosstab()语法：\n",
    "pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)\n",
    "#aggfunc：如果不指定函数，默认统计次数\n",
    "#rownames，colnames：行索引及列索引的名字"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例8：创建能查看每人每天上班次数的交叉表\n",
    "pd.crosstab(index=df['姓名'],columns=df['日期']).iloc[:,:5]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
